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Abstract 

Super-resolution is the problem of recovering a superposition of point sources using bandlim- 
ited measurements, which may be corrupted with noise. This signal processing problem arises 
in numerous imaging problems, ranging from astronomy to biology to spectroscopy, where it is 
common to take (coarse) Fourier measurements of an object. Of particular interest is in ob¬ 
taining estimation procedures which are robust to noise, with the following desirable statistical 
and computational properties: we seek to use coarse Fourier measurements (bounded by some 
cutoff frequency); we hope to take a (quantifiably) small number of measurements; we desire 
our algorithm to run quickly. 

Suppose we have k point sources in d dimensions, where the points are separated by at least 
A from each other (in Euclidean distance). This work provides an algorithm with the following 
favorable guarantees: 

• The algorithm uses Fourier measurements, whose frequencies are bounded by 0(1/A) (up 
to log factors). Previous algorithms require a cutoff frequency which may be as large as 

n{Vd/A). 

• The number of measurements taken by and the computational complexity of our algorithm 
are bounded by a polynomial in both the number of points k and the dimension d, with 
no dependence on the separation A. In contrast, previous algorithms depended inverse 
polynomially on the minimal separation and exponentially on the dimension for both of 
these quantities. 

Our estimation procedure itself is simple: we take random bandlimited measurements (as op¬ 
posed to taking an exponential number of measurements on the hyper-grid). Furthermore, our 
analysis and algorithm are elementary (based on concentration bounds for sampling and the 
singular value decomposition). 
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1 Introduction 


We follow the standard mathematical abstraction of this problem (Candes &: Fernandez-Granda 
mi): consider a d-dimensional signal x{t) modeled as a weighted sum of k Dirac measures in 

k 

xit) = '^Wj6^U), ( 1 ) 

where the point sources, the are in Assume that the weights Wj are complex valued, 

whose absolute values are lower and upper bounded by some positive constant. Assume that we 
are given k, the number of point source^ 

Define the measurement function f{s) : —>■ C to be the convolution of the point source x{t) 

with a low-pass point spread function as below: 


f{s) = 




i=i 


In the noisy setting, the measurements are corrupted by uniformly bounded perturbation z: 

lis) = f{s) + z{s), |z(s)| < e 2 ,Vs. 


( 2 ) 


( 3 ) 


Suppose that we are only allowed to measure the signal x{t) by evaluating the measurement 
function f{s) at any s G M'^, and we want to recover the parameters of the point source signal, i.e., 
: j G [fe]}. We follow the standard normalization to assume that: 

G [-1,+!]'^, |u;j|g[0,1] Vj G [A:]. 

Let Wmin = niinj \ wj\ denote the minimal weight, and let A be the minimal separation of the point 
sources defined as follows: 


A = min ^|| 2 , (4) 

j¥=j' 

where we use the Euclidean distance between the point sources for ease of expositiorj^ These 
quantities are key parameters in our algorithm and analysis. Intuitively, the recovery problem is 
harder if the minimal separation A is small and the minimal weight Wmin is small. 

The first question is that, given exact measurements, namely = 0, where and how many 
measurements should we take so that the original signal x{t) can be exactly recovered. 

Definition 1.1 (Exact recovery). In the exact case, i.e. = 0, we say that an algorithm achieves 
exact recovery with m measurements of the signal x{t) if, upon input of these m measurements, the 
algorithm returns the exact set of parameters : j G [A:]}. 

Moreover, we want the algorithm to be measurement noise tolerant, in the sense that in the 
presence of measurement noise we can still recover good estimates of the point sources. 

^An upper bound of the number of point sources suffices. 

^Our ciaims hoid withut using the “wrap around metric”, as in HE], due to our random sampling. Also, it is 
possible to extend these results for the £p-noTm case. 
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Definition 1.2 (Stable recovery). In the noisy case, i.e., > 0, we say that an algorithm achieves 

stable recovery with m measurements of the signal x{t) if, upon input of these m measurements, 
the algorithm returns estimates : j G [fe]} such that 

min max 111/2*'-^^ — : j G [fc]| < poly{d,k)ez, 

where the min is over permutations vr on [k] and poly(d,k) is a polynomial function in d and k. 

By definition, if an algorithm achieves stable recovery with m measurements, it also achieves 
exact recovery with these m measurements. 

The terminology of “super-resolution” is appropriate due to the following remarkable result 
(in the noiseless case) of Donoho [9j: suppose we want to accurately recover the point sources to 
an error of 7, where 7 <C A. Naively, we may expect to require measurements whose frequency 
depends inversely on the desired the accuracy 7. Donoho [9] showed that it suffices to obtain a 
finite number of measurements, whose frequencies are bounded by 0(1/A), in order to achieve exact 
recovery; thus resolving the point sources far more accurately than that which is naively implied by 
using frequencies of 0(1/A). Furthermore, the work of Candes & Fernandez-Granda [11[3] showed 
that stable recovery, in the univariate case {d = 1), is achievable with a cutoff frequency of 0(1/A) 
using a convex program and a number of measurements whose size is polynomial in the relevant 
quantities. 


1.1 This work 


We are interested in stable recovery procedures with the following desirable statistical and com¬ 
putational properties: we seek to use coarse (low frequency) measurements; we hope to take a 
(quantifiably) small number of measurements; we desire our algorithm run quickly. Informally, our 
main result is as follows: 


Theorem 1.3 (Informal statement of Theorem 3.2). For a fixed probability of error, the proposed 
algorithm achieves stable recovery with a number of measurements and with computational runtime 
that are both on the order of 0{{klog{k) +d)‘^). Furthermore, the algorithm makes measurements 
which are bounded in frequency by 0(1/A) (ignoring log factors). 


Notably, our algorithm and analysis directly deal with the multivariate case, with the univariate 
case as a special case. Importantly, the number of measurements and the computational runtime 
do not depend on the minimal separation of the point sources. This may be important even 
in certain low dimensional imaging applications where taking physical measurements are costly 
(indeed, super-resolution is important in settings where A is small). Furthermore, our technical 
contribution of how to decompose a certain tensor constructed with Fourier measurements may be 
of broader interest to related questions in statistics, signal processing, and machine learning. 


1.2 Comparison to related work 

Table summarizes the comparisons between our algorithm and the existing results. The multi¬ 
dimensional cutoff frequency we refer to in the table is the maximal coordinate-wise entry of any 
measurement frequency s (i.e. ||s||oo)- “SDP” refers to the semidefinite programming (SDP) based 
algorithms of Candes Sz Fernandez-Granda [am; in the univariate case, the number of measure¬ 
ments can be reduced by the method in Tang et. al. |23) (this is reflected in the table). “MP” 
refers to the matrix pencil type of methods, studied in H! and |15j for the univariate case. Here, 
we are defining the infinity norm separation as Aqo = miiij^j/ ^||oo; which is understood 
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d = 1 

d > 1 


cutoff freq 

measurements 

runtime 

cutoff freq 

measurements 

runtime 

SDP 

1 

A 

klog{k) log(^) 

poly{^,k) 

Ch 

^oo 

(A_)d 


MP 

1 

A 

1 

A 

nr 

- 

- 

- 

Ours 

1 

A 

{klog{k)f 

{klog{k)f 

log(fcd) 

A 

(A:log(A;) + d)^ 

(A:log(A:) + d)^ 


Table 1: See Section [L^ for description. 
Here, we are implicitly using O(-) notation. 


See Lemma 3.3 for details about the cutoff frequency. 


as the wrap around distance on the unit circle. > 1 is a problem dependent constant (discussed 
below). 

Observe the following differences between our algorithm and prior work: 

1) Our minimal separation is measured under the .^ 2 -iiorm instead of the infinity norm, as in the 
SDP based algorithm. Note that Aqo depends on the coordinate system; in the worst case, it 
can underestimate the separation by a l/\/d factor, namely Aqo ~ A/y/d. 

2) The computation complexity and number of measurements are polynomial in dimension d and 
the number of point sources k, and surprisingly do not depend on the minimal separation of the 
point sources! Intuitively, when the minimal separation between the point sources is small, the 
problem should be harder, this is only reflected in the sampling range and the cutoff frequency 
of the measurements in our algorithm. 

3) Furthermore, one could project the multivariate signal to the coordinates and solve multiple 
univariate problems (such as in |19L I17j . which provided only exact recovery results). Naive 
random projections would lead to a cutoff frequency of 0{\/d/A). 

SDP approaches: The work in |3llH[T0] formulates the recovery problem as a total-variation 
minimization problem; they then show the dual problem can be formulated as an SDP. They focused 
on the analysis of d = 1 and only explicitly extend the proofs for d = 2. For d > 1, Ingham-type 
theorems (see |201[I2]) suggest that = 0(\/d). 

The number of measurements can be reduced by the method in [23] for the d = 1 case, which is 
noted in the table. Their method uses sampling “off the grid”; technically, their sampling scheme 
is actually sampling random points from the grid, though with far fewer measurements. 

Matrix pencil approaches: The matrix pencil method, MUSIC and Prony’s method are es¬ 
sentially the same underlying idea, executed in different ways. The original Prony’s method directly 
attempts to find roots of a high degree polynomial, where the root stability has few guarantees. 
Other methods aim to robustify the algorithm. 

Recently, for the univariate matrix pencil method, Liao & Fannjiang [TTj and Moitra m provide 
a stability analysis of the MUSIC algorithm. Moitra m studied the optimal relationship between 
the cutoff frequency and A, showing that if the cutoff frequency is less than 1/A, then stable 
recovery is not possible with matrix pencil method (with high probability). 

1.3 Notation 

Let M, C, and Z to denote real, complex, and natural numbers. For d G Z, [d] denotes the set 
[d] = {!,..., d}. For a set S, |5| denotes its cardinality. We use © to denote the direct sum of sets. 
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namely 5i © ^2 = {(a + 6) : a £ Si, b £ 52 }. 

Let Cn to denote the n-th standard basis vector in for n £ [d]. Let 2 = {a^£l^'^:|l^ll2 = 
1} to denote the d-sphere of radius R in the d-dimensional standard Euclidean space. 

Denote the condition number of a matrix X £ as cond 2 (X) = (Tmax{X)/amin{X), where 

(^max{X) and amin{X) are the maximal and minimal singular values of X. 

We use © to denote tensor product. Given matrices A,B,C £ the tensor product 

V = A 0 B 0 C £ is equivalent to = Yln=i ^h^nBi^^nCi^^n- Another view of 

tensor is that it defines a multi-linear mapping. For given dimension myi,mB,mc the mapping 
V{-, ■, ■) : X crnxms x ^ ^m^xm^xmc ig defined as: 

[V(Xjl, Xb, Xc)]i-^^i2,i^ ^ ^ ^'l,i2J3[^^]il,*l[A's]j2,*2[A'c]j3,j3- 

jl J2 J3e[m] 

In particular, for a £ C”^, we use V{I,I,a) to denote the projection of tensor V along the 3rd 
dimension. Note that if the tensor admits a decomposition V = A (S> B (S> C, it is straightforward 
to verify that 

V{I,I,a) = ADiagiC'^ a)B'^. 

It is well-known that if the factors A, B, C have full column rank then the rank k decomposition 
is unique up to re-scaling and common column permutation. Moreover, if the condition number 
of the factors are upper bounded by a positive constant, then one can compute the unique tensor 
decomposition V with stability guarantees (See [T] for a review. Lemma 
explicit statement.). 

2 Warm-up 

2.1 1-D case: revisiting the matrix pencil method 

Let us first review the matrix pencil method for the univariate case, which stability was recently 
rigorously analyzed in Liao & Fannjiang mi and Moitra m- 

A square matrix H is called a Hankel matrix if its skew-diagonals are constants, namely Hi^ = 
For some positive constants m G Z, sample to get the measurements f{s) evaluated at 
the sampling set ^3 = {0,1,..., 2m}, and construct two Hankel matrices Hq, Hi £ 


Ho = 

1 

/(I) • 
/(2) . 

.. /(m-1) 

f{m) 

, Hi = 

1_ 

/(2) . 

/(3) . 

/(m) 

• f{m + l) 


. f{m - 1) 

/(m) . 

.. /(2m - 1) _ 


. f{m) 

/(m + 1) . 

/(2m) 


3.5 herein provides an 


Define to be the diagonal matrix with the weights on the main diagonal: [D^]jj = 

vij. Define 6 CjjJ to be [DJ,,, = e‘w“, 

A matrix V is called a Vandermonde matrix if each column is a geometric progression, defined 
the Vandermonde matrix Vm £ as below: 


14,, = 


(giTTAtd))! 


^glTT^ 1 




( 6 ) 
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The two Hankel matrices Hq and Hi admit the following simultaneous diagonalization: 

Ho = V^D^V^, Hi = VmD^D^vZ- (7) 

As long as Vm is of full rank, this simultaneous diagonalization can be computed by solving the 
generalized eigenvalue problem, and the parameters of the point source can thus be obtained from 
the factor Vm and 

The univariate matrix pencil method only needs m > k to achieve exact recovery. In the 
noisy case, the stability of generalized eigenvalue problem depends on the condition number of the 
Vandermonde matrix Vm and the minimal weight Wmin- 

Since all the nodes ’s) of this Vandermonde matrix lie on the unit circle in the complex 

plane, it is straightforward to see that asymptotically limm-^oo cond 2 (Vm) = 1. Furthermore, for 
m > 1/A, [m [15] showed that cond 2 (Vn) is upper bounded by a constant that does not depend 
on k and m. This bound on condition number is also implicitly discussed in m- 

Another way to view the matrix pencil method is that it corresponds to the low rank 3rd order 
tensor decomposition (see for example [T] ). This view will help us generalize matrix pencil method 
to higher dimension d in a direct way, without projecting the signal on each coordinate and apply 
the univariate algorithm multiple times. For m > k, construct a 3rd order tensor F G ^mxmx 2 
with elements of Hq and Hi defined in ([^ as: 

Vj G [ 2 ],i,i' G [m]. 

Note that the two slices along the 3rd dimension of F are Hq and Hi. Namely F{I,I,ei) = Hq, 
and T(/,/, 62 ) = Hi. Recall the matrix decomposition of Hq and Hi in ([^. Since m>k and the 
/i*'-^)’s are distinct, we know that F has the unique rank k tensor decomposition: 

F = Vm®Vm®{V2Du,). 

Given the tensor F, the basic idea of the well-known Jennrich’s algorithm mm) for finding 
the unique low rank tensor decomposition is to consider two random projections vi,V 2 G M™, 
and then with high probability the two matrices F{I,I,vi) and T(I,/, U 2 ) admit simultaneous 
diagonalization. Therefore, the matrix pencil method is indeed a special case of Jennrich’s algorithm 
by setting vi = ei and V 2 = 62 

2.2 The multivariate case: a toy example 

One could naively extend the matrix pencil method to higher dimensions by using taking measure¬ 
ments from a hyper-grid, which is of size exponential in the dimension d. We now examine a toy 
problem which suggests that the high dimensional case may not be inherently more difficult than 
the univariate case. 

The key ideas is that an appropriately sampled set can significantly reduce the number of 
measurements (as compared to using all the grid points). Tang et al |23) made a similar observation 
for the univariate case. They used a small random subset of measurements (actually still from the 
grid points) and showed that this contains enough information to recover all the measurement on 
the grid; the full measurements were then used for stably recovering the point sources. 

Consider the case where the dimension d > k. Assume that Wj^s are real valued, and for all 
j G [k] and n G [d], the parameters /j.n'^ are i.i.d. and uniformly distributed over [—1,-|-1]. This 
essentially corresponds to the standard (L 2 ) incoherence conditions (for the /r^-^^’s).n The following 
simple algorithm achieves stability with polynomial complexity. 

^ This setting is different from the 2-norm separation condition. To see the difference, note that the toy algorithm 
does not work for constant shift -|- A. This issue is resolved in the general algorithm, when the condition 

is stated in terms of 2-norm separation. 
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First, take number of measurements by evaluating /(s) in the set ^3 = {s = 6^ + en 2 + fins : 
[ni, n 2 , na] G [d] x [d] x [d]}, noting that ^3 contains only a subset of d? points from the grid of [3]^. 
Then, construct a 3rd order tensor F G ([^dxdxd measurements in the following way: 

|, Vni,n2,n3 G [d]. 

Note that the measurement/(ei+e 2 +e 3 ) = e™^3\ 

It is straightforward to verify that F has a rank-A; tensor factorization F = Ki® (V^Z)^), where 
the factor 14 G is given by: 


• (1) 

■ (fc) 

• (1) 

■ (fc) 

• (1) 

. (k) 


( 8 ) 


\J ) 

Under the distribution assumption of the point sources, the entries are i.i.d. and uniformly 

distributed over the unit circle on the complex plane. Therefore almost surely the factor I 4 has 
full column rank, and thus the tensor decomposition is unique. Moreover here tCj’s are real and 
each element of Vs has unit norm, we have a rescaling constraint with the tensor decomposition, 
with which we can uniquely obtain the factor Vs and the weights in D^. By taking element-wise 
log of Vs we can read off the parameters of the point sources from Vs directly. Moreover, with 
high probability, we have that cond 2 (I 4 ) concentrates around 1, thus the simple algorithm achieves 
stable recovery. 


3 Main Results 

3.1 The algorithm 

We briefly describe the steps of Algorithm below: 

(Take measurements) Given positive numbers m and R, randomly draw a sampling set S = 
of m i.i.d. samples of the Gaussian distribution J\f{Q,R?Idxd)- Form the set 
5' = 5 U = ei,.. ., = Crf, = 0} C M'^. Denote m' = m + d + 1. Take 

another independent random sample v from the unit sphere, and define = v, = 2v. 
Gonstruct the 3rd order tensor F G C™ with noise corrupted measurements /(s) evaluated 

at the points in 5' © 5' © arranged in the following way: 

Fnun2,n3 = /(■s) L=^{ni)+^(n 2 )+^(r. 3 ) , Vui , 712 G [m'],n3 G [2]. (9) 


(Tensor decomposition) Define the characteristic matrix Vs to be: 









(10) 
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Input: i2, m, noisy measurement function /(•). 

Output: Estimates : j G [A:]}. 

1. Take measurements: 

Let S = ..., be m i.i.d. samples from the Gaussian distribution A/’(0, R'^I^xd)- 

Set = Bn for all n G [d] and = Q. Denote m' = m + d + 1. 

Take another random samples v from the unit sphere, and set = v and = 2v. 
Construct a tensor F G -Fni,n 2 ,n 3 = /(s)L=s(ni)+^(n 2 )+i,(" 3 ) ■ 

2. Tensor Decomposition: Set {Vs',Dw) = TensorDecomp(F). 

For j = 1,... , k, set [^ 5 /]^ = [Vs']j/[Vs']m',j 

3. Read of estimates: For j = 1,... ,k, set = Real{log{[Vs][m+i-.m+d,j])/■ 

4. Set W = argmin^g^fc ll-P" — ® Vs' ® VdDy^Wp- 

Algorithm 1: General algorithm 


and define matrix V' G to be 


where G is defined in Q. 

G2 = 


F 5 , 


Define 


G5 

Vd 


gi7r</iO) > 


1 ... 1 


( 11 ) 


Note that in the exact case (cz 


0) the tensor F constructed in Q 


admits a rank-A; decomposition: 


F = F 5 / (g) Vs' <S> (V2 Dw), 


( 12 ) 


Assume that Vs' has full column rank, then this tensor decomposition is unique up to column 
permutation and rescaling with very high probability over the randomness of the random unit 
vector V. Since each element of Vgi has unit norm, and we know that the last row of Vs' and the 
last row of F 2 are all ones, there exists a proper scaling so that we can uniquely recover rcj’s and 
columns of Vs' up to common permutation. 

In this paper, we adopt Jennrich’s algorithm (see Algorithm]^ for tensor decomposition. Other 
algorithms, for example tensor power method m) and recursive projection ([23]), which are 
possibly more stable than Jennrich’s algorithm, can also be applied here. 


(Read off estimates) Let log(Vrf) denote the element-wise logarithm of Vd- The estimates of the 
point sources are given by: 





( 2 ) 



log(Gd) 

ITT 










Input: Tensor F G £rnxmx3^ rank k. 
output: Factor V G 

1. Compute the truncated SVD of F{I, I, ei) = PAP^ with the k leading singular values. 

2. Set E = F{P,PJ). Set El = E{I, I, ei) and E 2 = E{I, I, 62 ). 

3. Let the columns of U be the eigenvectors of EiE^^ corresponding to the k eigenvalues witf 
the largest absolute value. 

4. Set V = ^PU. 

Algorithm 2: TensorDecomp 


Remark 3.1. In the toy example, the simple algorithm corresponds to using the sampling set 
S' = {ei,...,erf}. The conventional univariate matrix pencil method corresponds to using the 
sampling set S' = {0,1, ..., m] and the set of measurements S' ® S' Q) S' corresponds to the grid 


my 


3.2 Guarantees 

In this section, we discuss how to pick the two parameters m and R and prove that the proposed 
algorithm indeed achieves stable recovery in the presence of measurement noise. 

Theorem 3.2 (Stable recovery). There exists a universal constant C such that the following holds. 
Fix ex,Ss,Sy G (0, 5 ); 

pick m such that m > max | ^ log ^, dj; 
for d = 1, pick R > \og{^+ 2 jj^ _ ^ 

Assume the bounded measurement noise model as in @ and that (l+if 

With probability at least (1 — 5s) over the random sampling of S, and with probability at least 
(1 —(5^) over the random projections in Algorithm^ the proposed Algorithm^returns an estimation 
of the point source signal x{t) = ^ 3 ^pi) accuracy: 


> 2 , 


2.5 


mm max 




w} 


< c 


Vdk^ 


w„ 


A5y 


wt 


1 + 262 
1 - 2e, 


2.5 


where the min is over permutations tt on [k]. Moreover, the proposed algorithm has time complexity 
in the order of 0{{m')^). 


Proof, (of Theorem 3.2) The algorithm is correct if the tensor decomposition in Step 2 is unique, 


and achieves stable recovery if the tensor decomposition is stable. By the stability Lemma of tensor 
decomposition (Lem ma 


3.1 


follows from Lemma 


By the main technical 


3.5), this is guaranteed if we can bound the condition number of Vs'. It 


that the condition number of V 5 / is at most \/1 + Vk times of cond 2 (V 5 '). 


emma (Lemma 3.10) we know that with the random sampling set S of 


size m, the condition number cond 2 (V 5 ) is upper bounded by a constant. Thus we can bound the 


distance between Vs' and the estimation Vs' according to (13) 
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Since we adopt Jennrich’s algorithm for the low rank tensor decomposition, the overall com¬ 
putation complexity is roughly the complexity of SVD of a matrix of size m' x m', namely in the 
order of 0((m')^). □ 

The next lemma shows that essentially, with overwhelming probability, all the frequencies taken 
concentrate within the hyper-cube with cutoff frequency R' on each coordinate, where R' is com¬ 
parable to R, 

Lemma 3.3 (The cutoff frequency). For d > 1, with high probability, all of the 2(m')^ sampling 
frequencies in 5'©5'© satisfy that ^ Vji, ^2 £ [?^]) Js £ [2], 

where the per-coordinate cutoff frequency is given by R' = 0{RyJ\og md). 

For d = 1 case, the cutoff frequency R' can be made to be in the order of R' = 0(1/A). 

Proof. For d > 1 case, with straightforward union bound over the m' = 0{k‘^) samples each of 
which has d coordinates, one can show that the cutoff frequency is in the order of R\/\og{kd), 

where R is in the order of as shown in Theorem 

For d = 1 case, we bound the cutoff frequency with slightly more careful analysis. Instead of 
Gaussian random samples, consider uniform samples from the interval [—R',R']. We can modify 
the proof of Lemma |3.9| and show that if R' > 1/(A(1 + Cx)): 

IT I = — f ^ sin(7r|/r(-^') - pf^'>\R') 

-A 

sin(7rAd?')/(7rAi?') 

1 — sin(7rAd?')/(7rAd2') “ ^ 

where the second last inequality uses the inequality that ^ 1 ^ 

Remark 3.4 (Failure probability). Overall, the failure probability consists of two pieces: 6v for 
random projection of v, and 6s for random sampling to ensure the bounded condition number of Vs- 
This may be boosed to arbitrarily high probability through repetition. 


^ sin(Z7rAd2') ^ 


1=1 


{IttAR') 


3.2 


3.3 Key Lemmas 


Stability of tensor decomposition: In this paragraph, we give a brief description and the 
stability guarantee of the well-known Jennrich’s algorithm mm) for low rank 3rd order tensor 
decomposition. We only state it for the symmetric tensors as appeared in the proposed algorithm. 

Consider a tensor F = V ®V ® (I^Uio) G ©™.x").x3 factor V has full column rank k. 

Then the decomposition is unique up to column permutation and rescaling, and Algorithm finds 
the factors efficiently. Moreover, the eigen-decomposition is stable if the factor V is well-conditioned 
and the eigenvalues of FaF^ are well separated. 


Lemma 3.5 (Stability of Jennrich’s algorithm). Consider the 3rd order tensor F = V ® V ® 
{y 2 F>w) £ (C™-xmx3 qJ k <m, constructed as in Step 1 in Algorithm 1. 

Given a tensor F that is element-wise close to F, namely for all ni,n 2 ,n^ G [m], |Tni,n 2 ,n 3 — 


F, 


ni,n 2 ,n 3 | < (-z, and assume that the noise is small 


^ rmn 


, Use F as the input to 

_ cond2(V)^ 

Algorithm^ With probability at least (1 — (5„) over the random projections and v^‘^\ we can 
bound the distance between columns of the output V and that of V by: 

2 , 


mm max ■ 

TT j 


\^3 ^(i) II2 ■ j ^ 


w} 


< C 


^/dF- Wmax 
Adu wf- 


cond2{Vy€z, 


(13) 
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where C is a universal constant. 


Proof, (of Lemma 3.5) The proof is mostly based on the arguments in [161 [2], we still show the 
clean arguments here for our case. 

We first introduce some notations for the exact case. Define Di = diag([D 2 ]i,:Tl,«) and D 2 = 
diag([V 2 ] 2 ,:D^). Recall that the symmetric matrix Fi = T(/, I, ei) = VDiV~^. Consider its SVD 
Fi = PA.P^. Denote U = P^V G Define the whitened rank-A: tensor 


kxkx3 


E = F{P, P, I) = {P'V)0{P'V)<S> {V2D^) = U®U® {V2D^) G C 

Denote the two slices of the tensor Ehy Ei = £'(/, /, ei) = UDiU~^ and E 2 = E(I, I, 62 ) = UD 2 U~^. 
Define M = EiEf^, and its eigen decomposition is given by M = UDU~^, where D = DiDf^. 
Note that in the exact case, D is given by: 

D = : j G [k]) 

Note that \Djj\ = 1 for all j. Define the minimal separation of the diagonal entries in D to be: 

sep{D) = min{min|Dj j — D,/ 
j¥=j' 

1. We first apply perturbation bounds to show that the noise in E propagates the estimates P 
and i? in a mild way when the condition number of V is bounded by a constant. 

Proof. Apply Wedin’s matrix perturbation bound, we have: 

||Ti -F 1 II 2 




< 


m 




And then for the two slices of if = F{P, P, I), namely Ei = Ei + Zi for f = 1, 2, we can bound the 
distance between estimates and the exact case, namely Zi = P^FiP — P~^EiP, by: 


1^*11 <8||Fi||||P||||P-P||+4||P||2||T--Fill < 16 


Wmax 


Wr 


cond2(R)^ezVm 


□ 


2. Then, recall that M = EiE^ ^ = UDU Note that 

M = (Fi + Zi){E2 + Z2)-^ = EiEf^{I - Z2{I + Ef^Z2)-^E:^^) + ZiEf\ 


Let H and G denote the perturbation matrices: 

H = -Z2{I + F 2 “^F 2 )-^F 2 "\ G = ZiE^\ 


In the following claim, we show that given M = EiEf^^ = M{I+H)+G for some small perturbation 
matrix H and G, if the perturbation ||P|| and ||G|| are small enough and that sep{D) is large enough, 
the eigen decomposition M = UDU~^ is close to that of M. 


Claim 3.6. If\\MH + a\\ < 

the columns of U and U by: 


then the eigenvalues of M are distinct and we can bound 


min max IIP,- — Pn-cnlb 

TT j u; 


< 3 


max {D)+a 

max (G), 


Gr 


,{U)sep{D) 


Ui 


2 - 
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Proof. Let \j and Uj for j G \k] denote the eigenvalue and corresponding eigenvectors of M. If 

+ ^11 ^ 2A7!2(uy 

||M - M|| = \\U-^{M + {MH + G))U - L»|| = \\U-^{MH + G)U\\ < sep{D)/2Vk, 


thus apply Gershgorin’s disk theorem, we have IAj — AjI < \\[U ^{MH+G)U]j\\i < Vk\\[U ^{MH+ 
G)U]j \\2 < sep{D)/2. Therefore, the eigenvalues are distinct and we have 

|Aj — Xji\ > \Xj — Aj/| — |Aj — Xj\ > -\Xj — Xji\ > -sep{D). (14) 

Note that {Uj/} and {Uj} define two sets of basis vectors, thus we can write Uj = Ylj/CjiUji 
(with the correct permutation for columns of Uj and Uj) for some coefficients Cj, = 1. Apply 
first order Taylor expansion of eigenvector definition we have: 

XjUj = MUj = {M + {MH + G)) ^ CjMj> = ^ Xj^CjMj^ + {MH + G)Uj. 

j' j' 


Since we also have XjUj = Yhj' we can write ^ji{Xj — Xji)cjiUji = {MH+G)Uj, and we can 

solve for the coefficients Cj/’s from the linear system as [{Xj — Xj/)cji : j' G [A:]] = U~^{MH + G)Uj. 
Finally plug in the inequality in (|14[) we have that for any j: 


Wj - U,\\l = ^ c}\\Up\\l + {Cj - l)^\\Ujg 
j'yj 

<2E4ii'0'ii2 

j'M 

\U-\MH + G)Uj\\l 


< 


< 


sep{D)‘^ 

, {^max {^)^ max {H) + a max {G))\ 
amin{U)‘^ sep{D)‘^ 


rr.||2||w.||2 

J II 2 II ^] II 2 


□ 


3. Note that in the above bound for \\Uj — Uj\\^ we can bound the perturbation matrices H and 


Gby: 

^max{H) E 
^max{G) E 


_ S^ 11^211 

(1 ~ 0'max{E2 ^■^2))<7min(-E'2) <7mm(A'2) ~ l|-^2|| 

g'max(-^l) ^ _ 

<7mm(L/2) ^min{U )‘^ 0'rain{D2) 


11^211 

“ Crmin{U)‘^amin{D2) “ ||^ 2 || ’ 


Note that crmin{D 2 ) > Wmin and a m»U D) = 1 by definition. In the following claim, we apply 
anti-concentration bound to show that with high probability sep{D) is large. 

Claim 3.7. For any 6v G (0,1), with probability at least 1 — 5v, we can bound sep{D) by: 


sep{D) > 


y/dk'^' 
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Proof. Denote v = and note that ||f || < \f2. In the regime we con cern , for any pair 

we have that 


f / we have ^^>1 < | <; ^0) _ ^U')^y > |. Apply Lemma 

for 5 G (0,1), 


4.3 


< \ v > 


<||/,(i)_^(/)||_)<<5. 


Take a union bound over all pairs of j ^ j', we have that 

P ^for somej / | < > \ < < k'^-^ = S. 

Recall that A = 


□ 


4. Recall that U = P^V . Note that since P has orthonormal columns, we have crmin{U) = 
CTminiV) and \\Ui\\ < ||Ri|| = y/m. 

Finally we apply perturbation bound to the estimates Vi = PUi and conclude with the above 
inequalities; 


|R* - Rill < 2(||p - PllllPill + ||P||||Pi - UiW) 

^7.\fkn CTmax(P)<7max(D) P (^maxiG') 


< 2 


< 2 


{Vf 

ezV^ 


+ 3- 


+ 6 - 


(Tmin{U)sep{D) 

II^2||||R|| 


IIRII HR 


<c^( 


^Wmin(^min(V)'^ {Omin (R)2 

^min P2)- 11^211) ^ min {V)sep{D) 

^/dk‘^m WmaxCOn(l2{V)‘^ 


'^min^rniniV)^ 




for some universal constant C. Note that the last inequality used the assumption that is small 
enough. □ 

Condition number of Vs''- The following lemma is helpful: 

Lemma 3.8. Let Vs' G (^(m+c^+ilx*: ^/^g factor as defined in pTj ). Reeall that Vs' = [R 5 ;Rd;l], 

where is defined in Q, and Vs is the charaeteristic matrix defined in (10). 

We ean bound the condition number of Vs' by 


cond2{Vs') < \/l + Vkcond2{Vs). 


(15) 


Proof, (of Lemma 3.8) By definition, there exist some constants A and X' such that cond 2 (R 5 ) = 
A'/A, and for all w G Pi 2 , we have A < ||Rstc|| < A'. Note that each element of the factor V 5 / lies 
on the unit circle in the complex plane, then we have: 

A^ < liRs'fi'lli < llRs'^^lli ^ (-^0^ + '^d. 

We can bound the condition number of Vg' by: 

cond2(R5') < ^ 1 + ^^cond2(R5) < \JI + '/kcond2{Vs), 


where the last inequality is because that max^u ||R 5 tc ||2 > liRseilli = d, we have (A')^ > d. 


□ 
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Condition number of the characteristic matrix Vs’ Therefore, the stability analysis of 
the proposed algorithm boils down to understanding the relation between the random sampling set 
S and the condition number of the characteristic matrix Vs- This is analyzed in Lemma 3.10 (main 
technical lemma). 

Lemma 3.9. For any fixed number ex G (0,1/2). Consider a Gaussian vector s with distribution 

R‘^Idxd); where R > Jqj- d > 2, and R > ^ _ 2 . Define the 

Hermitian random matrix Xg G ^herm 




— ,S> 

—,s> 






(16) 


ITe can bound the spectrum ofKs[Xs] by: 

(1 “ ^x)Ikxk ^ IEs[^s] ^ (1 T ^x)Ikxk- 


(17) 


Proof, (of Lemma 3.9) Denote Y = Es[Xs]. Note that Yjj = 1 for all diagonal entries. For d = 1 
case, the point sources all lie on the interval [—1,1], we can bound the summation of the off diagonal 
entries in the matrix Y by: 


j'¥=j 




-VirARfi I -i(7r(2A)i?)2 _, - i (7r(fc/2) ^ 


= 2 

<2{e~^^'‘^^> +e"20‘v^^m; - 

^ ^x- 

For d> 2 case, we simply bound each off-diagonal entries by: 

Yjj> = < ex/k. 


Apply Lemma 4.2 (Gershgorin’s Disk Theorem) and we know that all the eigenvalues of Y are 
bounded by 1 ± e^,. □ 


Lemma 3.10 (Main technical lemma). In the same setting of Lemma 3.9, Let S = ..., 

be m independent samples of the Gaussian vector s. For m > ^ySlog^, with probability at least 
1 — 5s over the random sampling, the condition number of the factor Vs is bounded by: 


cond2{Vs) < 


1 + ‘2.ex 
1 - 2e, 


(18) 


Proof, (of Lemma 3.10) Let ..., denote the i.i.d. samples of the random matrix Xg 

les in S. 

m ^ 

WVswWi = w ' V^Vsw = n;' I £ A« 


defined in (16), with s evaluated at the i.i.d. random samples in S. Note that we have 

' 1 "" 


w. 


m 


2 = 1 
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By definition of condition number, to show that cond 2 (V 5 ) < , it suffices to show that 


(l-2e,)4xfc^ ^(l + 2ej4xfc. 


Z=1 


By Lemma 3.9, the spectrum of lies in (1 — e^,, 1 + ex)- Here we only need to show that 

the spectrum of the sample mean is close to the spectrum of the expectation Es[Xs]. 

Since each element of the random matrix Xg G lies on the unit circle in the complex plane, we 


have Xg ^ k^I almost surely. Therefore we can apply Lemma 4.1 (Matrix Hoeffding) to show that 
for m > ^-y/slog ^, with probability at least 1—4; h holds that ||^ < ex- □ 


4 Discussions 

4.1 Numerical results 

We empirically demonstrate the performance of the proposed super-resolution algorithm in this 
section. 

First, we look at a simple instance with dimension d = 2 and the minimal separation A = 0.05. 
Our perturbation analysis of the stability result limits to small noise, i.e. e^ is inverse polynomially 
small in the dimensions, and the number of measurements m needs to be polynomially large in the 
dimensions. However, we believe these are only the artifact of the crude analysis, instead of being 
intrinsic to the approach. In the following numerical example, we examine a typical instance of 8 
randomly generated 2-D point sources. The minimal separation A is set to be 0.01, and the weights 
are uniformly distributed in [0.1,1.1] The measurement noise level e^ is set to be 0.1, and we take 
only 2178 noisy measurements (<C 1/A^). Figure shows reasonably good recovery result. 



Figure 1: The xy plane shows the coordinates of the point sources: true point sources (cyan), the 
two closest points (blue), and the estimated points (red); the z axis shows the corresponding mixing 
weights. Dimension d = 2, number of point sources k = 8, minimal separation A = 0.05 and the 
measurement noise level = 0.1; we set the cutoff frequency R = 200 (in the same order as 1/A), 
take 2178 random measurements (<C 1/A^). 

Next, we examine the phase transition properties implied by the main theorem. 
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Figure shows the dependency between the cutoff frequency and the minimal separation. For 
each hxed pair of the minimal separation and the cutoff frequency {A,R), we randomly generate 
k = 8 point sources in 4-dimensional space while maintaining the same minimal separation. The 
weights are uniformly distributed in [0.1,1.!]. The recovery is considered successful if the error 


Ejsin < 0.1 (on average it tolerates around 4% error per coordinate per point 

source). This process is repeated 50 times and the rate of success was recorded. Figureplots the 
success rate in gray-scale, where 0 is black and 1 is white. 

We observe that there is a sharp phase transition characterized by a linear relation between the 


cutoff frequency and the inverse of minimal separation, which is implied by Theorem 3.2 


140 
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Figure 2: Fix dimension d = 4, number of point sources k = 8, number of measurements m = k'^, 
and the measurement noise level = 0.02. We vary the minimal separation such that A ranges 
from 0.005 to 0.1, and we vary the cutoff frequency R from 0 to 25. For each pair of {^,R) we 
randomly generate k point sources and run the proposed algorithm to recover the point sources. 

The recovery is considered successful if the error Ylj£[k] \J~ Hi — ^-l. This process is 
repeated 50 times and the rate of success was recorded. 

In a similar setup, we examine the success rate while varying the minimal separation A and the 
number of measurement m. 

In Figure]^ we observe that there is a threshold of m below which the number of measurements 
is too small to achieve stable recovery; when m is above the threshold, the success rate increases 
with the number of measurements as the algorithm becomes more stable. However, note that given 
the appropriately chosen cutoff frequency R, the number of measurements required does not depend 
on the minimal separation, and thus the computation complexity does not depend on the minimal 
separation neither. 

4.2 Connection with learning GMMs 

One reason we are interested in the scaling of the algorithm with respect to the dimension d is that 
it naturally leads to an algorithm for learning Gaussian mixture models (GMMs). 

We hrst state the problem: given a number of N i.i.d. samples coming from a random one 
out of k Gaussian distributions in d dimensional space, the learning problem asks to estimate the 



5 10 15 20 

cutoff freq R 


16 

































0.18 


0.16 
< 0.14 

I 0.12 

cd 

g- 0.1 
I 0.08 

'g 

E 0.06 
0.04 
0.02 


Figure 3: Fix dimension d = 4, number of point sources k = 8, and the measurement noise 
level €z = 0.03. We vary the minimal separation such that A ranges from 0.01 to 0.2, and we 
use the corresponding cutoff frequency R = . We also vary the number of measurements 

m from 4 to 64. For each pair of (A, m) we randomly generate k point sources and run the 
proposed algorithm to recover the point sources. The recovery is considered successful if the error 

Eje[fc] -/ ll^(i) _ ^U) ||2 < 0.1. This process is repeated 50 times and the rate of success was recorded. 

means and the covariance matrices of these Gaussian components, as well as the mixing weights. 
We denote the parameters by {{wj, where the mean vectors € [—1,+!]'^, the 

covariance matrices G and the mixing weights Wj G M+. Learning mixture of Gaussians 
is a fundamental problem in statistics and machine learning, whose study dates back to Pearson [18] 
in the 1900s, and later arise in numerous areas of applications. 

In this brief discussion, we only consider the case where the components are spherical Gaus¬ 
sians with common covariance matrices, namely = cr'^Idxd for all j. Moreover, we define the 
separation Ac by: 

a 

and we will focus on the well-separated case where Aq is sufficiently large. This class of well- 
separated GMMs is often used in data clustering. 

By the law of large numbers, for large d, the probability mass of a d-dimensional Gaussian 
distribution tightly concentrates within a thin shell with a \fdo distance from the mean vector. 
This concentration of distance leads to a line of works of provably learning GMMs in the well- 
separated case, started by the seminal work of Dasgupta|6| (spherical and identical S, Ag > 
complexity poly{d, k)) and followed by works of Dasgupta &: Schulman [8| (spherical and 
identical S, d S> log(A:), Aq > n(d^/'^), complexity poly{d,k)), Arora & Kannan |2T| (general and 
identical S, Aq > n(d^/^) complexity 0{k‘^)). 

Instead of relying on the concentration of distance and use distance based clustering to learn 
the GMM, we observe that in the well-separated case the characteristic function of the GMM has 
nice properties, and one can exploit the concentration of the characteristic function to learn the 
parameters. Note that we do not impose any other assumption on the dimensions k and d. 



number of measurements, m 
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Next, we sketch the basic idea of applying the proposed super-resolution algorithm to learn 
well-separated GMMs, guaranteeing that N the required number of samples from the GMM, as 
well as the computation complexity both are in the order of poly{d, k). Since a is a bounded scalar 
parameter, we can simply apply grid-search to find the best match. In the following we assume 
that the a is given and focus on learning the mean vectors and the mixing weights. 

Evaluate the characteristic function of a d dimensional Gaussian mixture X, with identical and 
spherical covariance matrix S = cr^Idxd, at s G 

j^[k] 


Also we let (pxis) denote the empirical characteristic function evaluated at s based on N i.i.d. 
samples {xi,... xx} drawn from this GMM: 

ie[N] 


Note that = 1 for all samples, thus we can apply Bernstein concentration inequality to 

the characteristic function and argue that |</>x('S) — </>Jv(s)| < for all s. 

In order to apply the proposed super-resolution algorithm, define 


f{s) = and f{s) 

j^[k] 




^x{s). 


In the context of learning GMM, taking measurements of f{s) corresponding to evaluating the 
empirical characteristic function at different s, for ||s||oo < R, where R is the cutoff frequency. 
Note that this implies ||s ||2 < dR^. Therefore, we have that with high probability the noise level 
€z can be bounded by 

^ / a^dB? 

ez = max |/(s) - f{s)\ = O 

ll^lloo^-R \ y I\ 

In order to achieve stable recovery of the mean vector using the proposed algorithm, on one 

hand, we need the cutoff frequency R = Vt[\/ gXq)] on the other hand, we need the noise level 
ez = o(l). It suffices to require a^dR? = o(l), namely having large enough separation > Q.{d}/'^). 
In summary, when the separation condition is satisfied, to achieve target accuracy in estimating 
the parameters, we need the noise level to be upper bounded by some inverse polynomial in the 
dimensions, and this is equivalent to requiring the number of samples from the GMM to be lower 
bounded by poly{k, d). 

Although this algorithm does not outperform the scaling result in Dasgupta[6], it still sheds light 
on a different approach of learning GMMs. We leave it as future work to apply super-resolution 
algorithms to learn more general cases of GMMs or even learning mixtures of log-concave densities. 



4.3 Open problems 

In a recent work, Chen &: Chi [5] showed that via structured matrix completion, the sample com¬ 
plexity for stable recovery can be reduced to 0{klog^ d). However, the computation complexity is 
still in the order of 0{k'^) as the Hankel matrix is of dimension 0{k^) and a semidefinite program is 
used to complete the matrix. It remains an open problem to reduce the sample complexity of our 
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algorithm from 0{k‘^) to the information theoretical bound 0{k), while retaining the polynomial 
scaling of the computation complexity. 

Recently, Schiebinger et al [22] studied the problem of learning a mixture of shifted and re-scaled 
point spread functions f{s) = Ylj This model has the Gaussian mixture as a special 

case, with the point spread function being Gaussian point spread ip{s, 

We have discussed the connection between super-resolution and learning GMM. Another interesting 
open problem is to generalize the proposed algorithm to learn mixture of broader classes of nonlinear 
functions. 
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Auxiliary lemmas 


Lemma 4.1 (Matrix Hoeffding). Consider a set {X^^\ ..., of independent, random, Her- 

mitian matrices of dimension k x k, with identical distribution X. Assume that E[Y] is finite, and 
Y a^I for some positive constant a almost surely, then, for all e > 0, 


Pr 


m 


-E[Y] 


2 = 1 


> e 


< ke 8 o -2 


Lemma 4.2 (Gershgorin’s Disk Theorem). The eigenvalues of a matrix Y G are all contained 

in the following union of disks in the complex plane: Uj^^VfYjj, Rj), where disk V{a,b) = {x G 
: jjx — ajj < b} and Rj = 

Lemma 4.3 (Vector Random Projection). Let a G be a random vector distributed uniformly 
over ® vector v G C"*. For 6 G (0,1), we have: 


Pr ( j < a, u > 1 < 


<5 < <5 


\V\\2 

fern 

Proof. This follows the argument of Lemma 2.2 from Dasgupta & Gupta |^. Extension to complex 
number is straightforward as we can bound the real part and the imaginary part separately. □ 


21 








